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CONDITIONAL PREDICTIVE INFERENCE 
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We give a finite-sample analysis of predictive inference procedures 
after model selection in regression with random design. The analysis 
is focused on a statistically challenging scenario where the number 
of potentially important explanatory variables can be infinite, where 
no regularity conditions are imposed on unknown parameters, where 
the number of explanatory variables in a "good" model can be of 
the same order as sample size and where the number of candidate 
models can be of larger order than sample size. The performance of 
inference procedures is evaluated conditional on the training sample. 
Under weak conditions on only the number of candidate models and 
on their complexity, and uniformly over all data-generating processes 
under consideration, we show that a certain prediction interval is ap- 
proximately valid and short with high probability in finite samples, 
in the sense that its actual coverage probability is close to the nom- 
inal one and in the sense that its length is close to the length of an 
infeasible interval that is constructed by actually knowing the "best" 
candidate model. Similar results are shown to hold for predictive in- 
ference procedures other than prediction intervals like, for example, 
tests of whether a future response will lie above or below a given 
threshold. 

1. Introduction. 

1.1. Motivation and summary. This paper is about inference on future 
observations based on a model that has been selected on the basis of the 
data and then fitted to the same data. We focus, in particular, on situations 
where the number of candidate models is large and where the number of 
explanatory variables in a "good" model can be large as well, in relation 
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to sample size. Such a situation is faced, for example, by Stenbakken and 
Souders [31] who predict the performance of analog/digital converters from 
partial measurements by selecting 64 explanatory variables (measurements) 
from a total of 8192 based on a sample of size 88; further examples include 
[1, 8, 12, 30, 33, 34, 35] and [37]. Note that, in these studies, the model that is 
selected, on the basis of the data, is often quite complex in relation to sample 
size, in the sense that the number of explanatory variables in the selected 
model and the sample size are of the same order of magnitude. Also note 
that the total number of candidate models in these studies exceeds sample 
size by several orders of magnitude. In such situations, inferential tools that 
assess the predictors' accuracy like, for example, the mean-squared error of 
the predictor, or prediction intervals, are needed. 

We consider a Gaussian regression model with random design, where the 
number of explanatory variables can be infinite, and where no regularity 
conditions are imposed on the unknown parameters. We use a variant of 
generalized cross-validation to evaluate the performance of candidate mod- 
els for prediction out-of-sample,^ to select a "good" model and to conduct 
predictive inference based on the selected model. The performance of the re- 
sulting model selector and the quality of predictive inference procedures are 
evaluated conditional on the training sample. We describe the performance 
of these methods by explicit finite-sample performance bounds. For exam- 
ple, we show that the proposed prediction interval is approximately valid 
and short, with high probability, even in statistically challenging situations 
where the number of explanatory variables in a "good" model is of the same 
order as sample size, and where the total number of candidate models is of 
a larger order than sample size. Here, approximately valid means that the 
prediction interval's actual coverage probability is close to the nominal one, 
and approximately short means that its length is close to the length of a 
certain infeasible "prediction interval" that is based on actually knowing the 
"best" candidate model. Our results hold uniformly over all data-generating 
processes under consideration. 

1.2. Our results in broader context. In the literature, results on predic- 
tive inference after model selection are scarce.^ The finite-sample distribu- 
tion of a linear predictor based on the selected model can be computed 
explicitly in sufficiently simple settings (see Leeb [18] and [19]). However, 



^Here, prediction "out-of-sample" means prediction of new responses given hitherto 
unobserved explanatory variables, whereas "in-sample" prediction means prediction of 
new responses for the same explanatory variables as observed in the training data. 

^This is in spite of the fact that predictive inference by itself is a rather well researched 
field (see, e.g., [3] and [9] for a frequentist and a Bayesian approach, respectively, as well 
as the references given there). 
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these results only allow for rather restricted collections of candidate models; 
moreover, as the number of candidate models increases, the resulting formu- 
lae get increasingly complicated and computationally infeasible. From the 
perspective of traditional large-sample analyses, on the other hand, predic- 
tive inference after model selection is typically rather trivial. Consider, for 
example, a parametric linear model where the response is a linear function 
of a finite number of explanatory variables and a random disturbance. Un- 
der standard assumptions, every sensible model selection procedure typically 
leads to a post model selection estimator that is consistent, even uniformly 
consistent (see, e.g.. Propositions A. 9 and B.l in [22]). In large samples, 
the random disturbance is therefore the dominant source of error when pre- 
dicting a new response. Thus, as far as prediction is concerned, all sensible 
model selection procedures perform alike in parametric settings in the large- 
sample limit. The same is true in nonparametric settings for appropriately 
chosen estimators of the true regression function, provided that the regres- 
sion function is a priori restricted to a sufficiently regular family like, say, a 
Besov body or a collection thereof, as it is often considered in nonparamet- 
ric function estimation. In situations where the true regression parameter or 
function can be estimated consistently or uniformly consistently, research is 
typically focused on finding estimators with good convergence rates, or on 
finding confidence sets for the true regression parameter or function that are 
valid and small. 

In this paper, we consider predictive inference after model selection in a 
situation that is difficult to analyze by exact finite-sample results or by large- 
sample limit theory. In particular, we focus on the statistically challenging 
scenario where the number of explanatory variables in a "good" model can 
be of the same order as sample size, and where the number of candidate 
models can be of larger order than sample size. This situation is typically 
too complex for an exact finite-sample analysis. Also, this situation is such 
that large-sample limit approximations cannot be guaranteed to be accurate 
in most cases. We do not rule out the case where: (i) a very simple model fits 
well and (ii) the number of candidate models is small, in relation to sample 
size. However, our results are most interesting in the case where one of these 
two conditions is not met. 

There are, however, a couple of results, in both parametric and nonpara- 
metric settings, that indicate that inference after model selection is a hard 
problem that is subject to certain insurmountable obstructions. Most of 
these results consider inference on the regression parameter itself, on com- 
ponents thereof or on the mean of a future response. 

Consider, first, a parametric linear model, with Gaussian errors and fixed 
design (under standard assumptions), and a linear predictor that is con- 
structed based on the outcome of a data-driven model selection step. It is 
well known that the distribution of such a linear predictor, properly scaled 
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and centered, typically depends on unknown parameters in a nontrivial way 
and can be highly nonnormal, regardless of sample size (see [21, 27]). 3 More- 
over, Leeb and Potscher [23, 24] showed that the distribution of such a linear 
predictor cannot be estimated in a uniformly consistent fashion, except in 
degenerate and trivial cases. Concerning confidence intervals for the mean 
of a future response, the results of Joshi [15] entail in the known-variance 
case that the standard interval based on fitting the overall model is admis- 
sible and uniquely minimax with respect to a loss function that measures 
both coverage probability and interval size (in the class of all randomized 
Lebesgue measurable confidence sets and up to trivial equivalences). (See 
also [17] for further references and results on confidence intervals post model 
selection.) 

In nonparametric function estimation, there are well-known limits to the 
adaptivity of honest L2 confidence balls for the true regression function. 
Here, "honest" means that the confidence ball guarantees coverage proba- 
bility over the whole function space under consideration, and "adaptivity" 
means that smoother regression functions (i.e., functions that belong to re- 
stricted submodels) are covered by smaller balls. In essence, larger function 
spaces limit the amount of adaptivity possible. This was first discovered by 
Li [25] and was further analyzed in [2, 4, 6, 7, 10, 14, 16] and [28]. Moreover, 
Baraud [2] also shows that honest and short confidence balls are feasible only 
if the error variance is assumed to lie in some bounded subset of (0, 00), and 
that loose variance bounds close to zero or infinity lead to large confidence 
balls. If an honest confidence band (i.e., an L^o confidence ball) is desired, 
then the limits to adaptivity are even more pronounced (see [11]). 

In the setting of this paper, where the goal is prediction out-of-sample, we 
demonstrate that prediction intervals post model selection can be simultane- 
ously valid and short in an approximate sense and with high probability, ir- 
respective of unknown parameters. The proposed prediction interval has the 
following two properties, except on an event whose probability is bounded 
by the expression in (1), which follows: (i) Its actual coverage probability 
is close to the nominal coverage probability, (ii) Its length is close to the 
length of a certain infeasible shortest possible interval that is constructed 
from actually knowing the "best" candidate model. These statements hold 
uniformly over all data-generating processes under consideration. 

On a technical level, this paper is related to Breiman and Freedman [5] 
in two regards: First, the model considered in this paper contains the model 



^This fact is at odds with a result of Shen, Huang and Ye [29], which claims that the 
limit distribution of a post model selection estimator in a parametric setting is normal 
with mean zero and estimable variance/covariance matrix (see Theorems 3 and 4 in that 
paper). Inspection of that paper reveals that the proof of Theorem 3 is in error as it 
stands, and that said theorem does not hold as claimed. Private communication with the 
authors has confirmed this. 
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considered in [5] as a special case [see the discussion following (2)], and, 
second, our results rely on a corresponding extension of Theorem 1.3 of [5] 
(see Proposition 2.1 and the attending discussion). The results derived here, 
however, differ considerably from those of [5] in terms of scope and content. 
We allow for families of candidate models of essentially arbitrary size and 
structure, while [5] is focused on up to n/2 models that are nested (where 
n denotes sample size). Moreover, we give finite-sample results that hold 
uniformly over all data-generating processes under consideration, while the 
main result in [5] is a pointwise large-sample limit result that requires that 
the true regression parameter has infinitely many nonzero components. 

1.3. Outline of the paper. As the data-generating process, we consider a 
Gaussian linear model with random design that is described in Section 2, 
where the number of potentially important explanatory variables can be infi- 
nite. We assume that the error and also the explanatory variables are jointly 
Gaussian, like Breiman and Freedman [5]. Assuming the data to be Gaussian 
allows us to derive explicit finite-sample performance bounds by relatively 
elementary means and to clearly showcase the mechanisms underlying our 
results. Simulation results in [20] strongly suggest that the assumption of 
Gaussianity is not essential, and unpublished preliminary results, which rely 
on random matrix theory, point in the same direction. The unknown param- 
eters in this setting are the sequence of regression coefficients as well as the 
means and the variance/ covariance structure of the explanatory variables 
and of the error term. No additional regularity conditions are imposed on 
the unknown parameters. 

We consider a scenario where the model is selected and fitted to the data 
once and is then used repeatedly for prediction and for predictive inference. 
For performance measures, like the mean-squared error of a predictor or the 
coverage probability or length of a prediction interval, we therefore adopt a 
conditional perspective and treat the training sample as fixed and the future 
response and its corresponding explanatory variables as random.^ 

Given a sample of size n and a collection A4 of candidate models, a pre- 
liminary first goal is to evaluate models m£ A4 based on their performance 
for prediction out-of-sample, and to select a model that performs well for 
this purpose; this is the subject of Section 3. Our second and main goal is to 
conduct inference on future observations based on the selected model like, 
for example, prediction intervals; this goal is studied in Section 4. 



*This deviates from conventional linear model theory, where, usually, the training sam- 
ple is considered random and where, often, the explanatory variables that are used for 
prediction are considered as fixed. Regarding prediction intervals, our approach may be 
compared to the average coverage probability introduced by Wahba [36] and further ana- 
lyzed in [26]. 
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To achieve both goals outhned in the preceding paragraph, we consider 
a model selector and predictive inference procedures post model selection 
that are based on a variant of generalized cross-validation (and that are 
described in detail later). We show that the proposed prediction interval 
is approximately valid and short, except on an event whose probability is 
bounded by 

(1) Ciexp[\og#M-C2in-\M\)], 

uniformly over all data-generating processes under consideration. Here, i^M 
is the number of candidate models, \M\ is the number of explanatory vari- 
ables in the most complex candidate model and Ci and C2 are explicit 
positive constants. The bound in (1) decreases exponentially fast in n — \M.\ 
and increases only linearly in This allows for very large classes of po- 

tentially complex candidate models. If the upper bound in (1) is small, the 
proposed prediction interval is approximately valid and its length is close 
to that of a certain infeasible "prediction interval" that is based on actually 
knowing the "best" candidate model, with high probability (see Proposi- 
tions 4.3 and 4.4 on page 15 for details). Furthermore, we show that the 
following statements hold, except on an event whose probability is bounded 
by (1) (with different values of the constants Ci and C2): (i) The perfor- 
mance of the selected model is close to the performance of the "best" can- 
didate model; (ii) the estimated performance of the selected model is close 
to its actual performance; and (iii) in general, the proposed procedures for 
predictive inference post model selection are approximately valid. In a sim- 
ulation example, we use a training sample of 2000 observations to perform a 
greedy search through a pool of over 10^^ candidate models (see Section 5). 

2. Basic assumptions and quantities of interest. For the data-generating 
process, consider a response y that is related to a collection of explanatory 
variables {xj)j>i by 

00 

(2) y = Yl ^if^j + ^■ 

Assume that the model includes an intercept (i.e., xi = 1) and that the Xj's 
for j > 1 and u are jointly nondegenerate Gaussian with unknown means 
and variance/covariance structure, such that the sum converges in L2.^ No 
additional regularity conditions will be imposed on the data-generating pro- 
cess throughout the paper. Breiman and Freedman [5] consider a special case 



^Hence, the distribution of any finite subset of {xj :j > 1} U {«} is a nondegenerate 
Gaussian with unknown mean- vector and variance/covariance matrix. It is often also as- 
sumed that the Xj^s axe uncorrelated with u and that u has mean zero. This assumption 
is not needed here. In essence, u plays the role of as an unobserved explanatory variable. 
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of (2), where the mean of the explanatory variables is known (and equal to 
zero), and where no intercept is included (i.e., /3i = 0). 

The minimal requirement, that the right-hand side of (2) converges in 
squared mean, restricts the possible values of /? = iPj)j>i in a way that 
depends on the moments of the explanatory variables. For example, if the 
Xj's, j > 1, are independent and identically distributed with mean zero and 
variance, say, one, then (3 can be any sequence of coefficients in ^2- This shows 
that (2) covers a large class of data-generating processes; further examples 
are outlined in Remark 6.1. Of course, (2) also covers parametric models 
with only finitely many explanatory variables (i.e., the case where Pj = 
from some index onward). Moreover, the requirement of nondegeneracy can 
be relaxed as outlined in Remark 6.2. 

Consider a sample of size n from (2). The sample will be denoted by (Y, X) 
with Y denoting the n- vector Y = {y^^\ . . . , y("))' and X denoting the n x oo 
"matrix" or net X = {x^^\ . . . ,x^"'^y , where {y^^\x^''^) are independent and 
identically distributed (i.i.d.) copies of {y,x) as in (2). 

The training sample will be used to fit finite-dimensional submodels of 

(2) that restrict some coefficients of f3 to zero, where the intercept (3i is 
always left unrestricted. Each such submodel is described by a 0-1 sequence 
m = {mj)j>i, where ruj = if the jth coefficient of /3 is restricted to zero 
and rrij = 1, otherwise. The number of unrestricted regression coefficients 
(i.e., J2j>i^j) is denoted by \m\. We assume that \m\ <n — 1 throughout 
the paper. 

Consider a finite collection of candidate models that will be denoted by M 
(which, of course, may depend on sample size n). Assume that each model 
m£ M satisfies 

(3) nil = 1 and \m\ <n — 1 

as before.^ We write |A^| for the number of parameters in the most complex 
model in M; that is, 

\Ai \ = max \m\, 

and we write H^M for the number of candidate models in M . 

For later use, let a'^{m) denote the variance of y conditional on those 
explanatory variables that are included in the model m; that is, 

a'^{m) = Ya.i[y\xj : rrij = l,j > 1] 



In practice, the choice of candidate models M to consider at sample size n is often 
guided by prior knowledge or suspicions about the structure of the underlying parameters. 
For example, if it is assumed or suspected that the coefflcients of /3 are sparse in an 
appropriate sense, one might consider appropriately sparse candidate models a well; such 
a case is discussed in the simulation example in Section 5. Another example is the case 
where the coefflcients of /3 are assumed or suspected to taper off at a certain rate. 
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for each m & A4. Note that this conditional variance does not depend on the 
Xj's because y and ixj)j>i are jointly Gaussian. Also note that < Var[tt] < 
(T^(m) < Var[?/] for each data-generating process as in (2); in particular, 
cr^(m) is always positive. 

The least-squares method will be used to fit models to the training sam- 
ple.^ The restricted least-squares estimator corresponding to a model m G 
is denoted by /3(m) = {Pjim))j>i and is defined as follows: For j satisfying 
rrij = 0, Pj{m) equals zero; the \m\ remaining components of (3{m) are ob- 
tained by regressing Y on the observed values of those regressors that are 
included in the model m (on the probability zero event where the result- 
ing n X |m| regressor matrix is rank deficient, we use the Moore-Penrose 
inverse, say, in the least-squares formula). The usual variance estimator 
based on model m will be denoted by o"^(m) and is given by (T^(m) = 
(n — |m|)~-^ RSS(m) with RSS(m) denoting the residual sum of squares ob- 
tained by fitting model m to the training sample [note that a^(m) > 0, 
almost surely]. 

The performance of a model will be evaluated in terms of the conditional 
mean-squared error of the linear predictor obtained from fitting the model 
to the training sample. Let {y^^\x^^^) be a new copy of as in (2), 

independent of the sample (Y, X). Based on a model m£ M and the sample 
{Y,X), the usual least-squares predictor of y^^^ will be denoted by y^^\m) 
and is given by 

oo 

y^f\m)=Y,x^^^Pjim). 
i=i 

Note that all but |m| coefficients of the restricted least-squares estimator 
P{m) are zero. The conditional mean-squared error of the predictor is now 
defined as 

= E[{y^f\m) - y'^f^f\Y,X]. 

Note that p'^{m) depends on the training sample and, hence, is a random 
variable, and that p'^{m) also depends on the unknown parameters in (2). 
In particular, p^{rn) is unknown. We will also consider the corresponding 
unconditional mean-squared error of the predictor (i.e., E[p^{m)]) and the 
positive square root p{m) of p'^{m). 

If the predictor y^f\m) is to be used for inferences about a new response 
y^^\ the distribution of the prediction error y^f\m) — y^f^ is of particular 



^While it is tempting to also consider penalized least-squares or more general shrinkage 
estimators, particularly for complex candidate models, our current methods cannot handle 
these estimators. 
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interest. Conditional on the training sample, the distribution of the pre- 
diction error y'^^\m) — y^^^ will be denoted by L(m). Clearly, L(m) is a 
Gaussian, and we write for the mean of that distribution and 5{m) for 
its standard deviation. In other words, 

y(-^)(m) -y(-^)|y,X~iV(zy(m),(52(m)) =L(m). 

Note that z/(m) is also the conditional bias of the predictor y^f\m), condi- 
tional on the training sample. As before, also note that the distribution of 
these quantities depends on the unknown parameters in (2), so that v{m) 
and 5'^{m) are unknown. Of course, we have p^{m) = v'^{m) + (5^(m). 

In terms of the conditional mean-squared error of prediction, the best 
candidate model is a minimizer of p^{rn) over M.. We write nip for such 
a minimizer; that is, 

rup = arg min (m) 

(on the event of multiple minimizers, nip is taken as a measurable selection 
from the set of minimizers). In Section 4, we will also consider the candidate 
model for which the conditional distribution of the prediction error [i.e., 
L(m)] is most concentrated. That model [i.e., a measurable minimizer of 
(5^(m) over m G M] is denoted by m^. 

For deriving and analyzing estimators for the quantities of interest 5^ (m) , 
z^^(m) and /9^(m) and for understanding the mechanisms underlying our 
main findings, the following result will be instrumental. 

Proposition 2.1. For each fixed model m € A4, the conditional vari- 
ance of the prediction error y'^^\ni) — y'^^^ given (Y^X) [i.e., 5'^{ni)] has the 
same distribution as a'^{ni) multiplied by the sum of one and the ratio of 
two independent chi-square random variables with \m\ — 1 and n — \m\ + 1 
degrees of freedom, respectively: 

5\ni)^a\ni)(l+ ^^^^ ). 

The conditional bias ofy^^\ni) given {Y,X) has mean zero (i.e., E[v{ni)\ = 0^. 
Moreover, the squared conditional bias v'^{m) has the same distribution as 
5'^{m)/n multiplied by an independent chi-square random variable with one 
degree of freedom: 

u (ni) ~ — a (m) ( 1 + ^ I . 

^ ^ri-|m|+l-^ 

Finally, the usual variance estimator in model m is distributed as a'^{m) ~ 
(T^(m)x^_[^[/(n — |m|) [in case \m\ = 1, the expression xfm]-! P''"^' 
ceding two displays is to be interpreted as constant equal to zero, so that 
(5^(?n) = a'^{m) and v'^{ni) ~ (x^/n)fT^(m) in this case]. 
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Proposition 2.1 extends Theorem 1.3 of Breiman and Preedman [5], which 
describes the distribution of (5^(m) in the case where the regressors in (2) all 
have mean zero and where models do not contain an intercept. For a slightly 
different conditioning sigma-field, the distribution of the corresponding con- 
ditional mean-squared error of the predictor is also derived by Thompson 



Proposition 2.1 shows that the squared conditional bias i/^(m) is of smaller 
order, by a factor of 1/n and in probabihty, than the conditional variance 
(5^(m). A little reflection shows that this is no surprise, for example, in the 
case where the fitted model is correct (i.e., in the case where m contains all 
nonzero coefficients of /?). By Proposition 2.1, the same is true regardless of 
how well the fitted model describes the true one. Of course, i^^(m) can be 
substantial because of either overfit or underfit, say. But, irrespective of that, 
the conditional variance 6'^{m) is the dominating factor in p^im) = v'^{m) + 
5'^{m), in probability. Another feature revealed by Proposition 2.1 is that 
the distributions of i^^ (m) and 5^ (m) depend on the unknown parameters in 
(2) only through iT^(m), and that (T^(m) can be estimated from the training 
sample with good accuracy, provided only that n — \m\ is large. For later use, 
we can also read-off the expected values of 5^(m), i^^(m) and p^{m) from 
Proposition 2.1. Because the mean of equals l/(n — \m\ — 1) for 

n — \m\ — 1 > 0, the mean of (5^(m) equals a'^{m){n — 2)/(n — \m\ — 1) and 
the mean of v'^{m) is n~^a'^[m){n — 2)/{n — \m\ — 1). From this, we also see 
that the mean of p'^{m), that is, the (unconditional) mean-squared error of 
the predictor y^^\m), is given by 



This formula for E[p'^{m)] is also derived in [13] and [32] by different means. 
Finally, Proposition 2.1 suggests that 6'^ {m) / E[5'^ (m)] is close to one pro- 
vided only that n — \m\ is sufficiently large, and that the same is true for 
p'^{m)/E[p'^{m)]. Formalizing this idea and using variations of Chernoff's 
method will lead to the main results of this paper. Theorems 3.1 and 4.1, 
which follow. 

3. Evaluating and selecting models. The performance of model m, as 
measured by the conditional mean-squared error of the predictor y^^\m) 
[i.e., as measured by p^(m)] depends on unknown parameters and, hence, 
cannot be used directly for model selection. We now consider several esti- 
mators for /?^(m). In view of Proposition 2.1 and the ensuing formula for 
E[p^{m)], we see that an unbiased estimator for E[p'^{m)] is given by 



[32]. 
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(see also [13, 32]). Of course, this estimator is also unbiased for p^{rn). The 
estimator fp'{m) is closely related to two well-known model selectors, namely 
generalized cross-validation and the Sp criterion, whose objective functions 
are defined by 



GCV(m)=(T (?n) j — - and Sp{m) = a (m) 



n — \m\ n — i — \m\ 



respectively. For fixed sample size n, choosing a model m that minimizes 
fp'{m) is equivalent to choosing a model that minimizes Sp{m). Moreover, for 
most practical purposes, the difference between p'^^m), Sp{m) and GCV(m) 
will be negligible. Because of technical reasons, we consider another estima- 
tor that is closely related to the three discussed so far. That estimator will 
be denoted by fP'{m) and is given by 

"2/ \ -2/ N 

p [m) = a (m) 



n+1 — \m\ 



Again, note that the difference between fp'{m), p^{m), GCV(m) and Sp{m) 
will be negligible for most practical purposes. The relation between GCV(m) 
or Sp{m) and other well-known model selection criteria in our setting is dis- 
cussed in detail in Section 3.3 of [20]. The next result describes the perfor- 
mance of /5^(m) as an estimator for the conditional mean-squared error of 
the predictor [i.e., as an estimator for />^(m)] in finite samples.® 



Theorem 3.1. Fix a candidate model m£ M. For each e > 0, we have 



(4) 



log 



fp'{m) \ . „ \ n — \m\ 



p'^{m) 



> e < 6 exp 



e + i 



The relation in the preceding display holds uniformly over the set of all data- 
generating processes as in (2). 

Theorem 3.1 shows that the estimated performance of model m is close 
to its true performance, in the sense that the ratio fp' {m) / p^ {m) is close 
to one with high probability, provided only that n — \m\ is large enough, 
independently of the unknown parameters. The theorem places no restric- 
tion on sample size n and on the candidate model m [except for (3) that is 
maintained throughout the paper]. However, the result is most interesting in 
the case where the sample size is relatively small compared to the number 
of parameters in the model, in the sense that \m\/n is not close to zero. 



*In Theorem 3.1, the expression | logp^(m)/p^(m)| is, of course, well defined in case 
fP{m) >0 or, equivalently, in case a^{m) > 0, which is an almost sure event. In case 
p^{m) =0, \ log p^{m) / p^{m)\ is to be interpreted as oo. The same convention is also used, 
mutatis mutandis, in the results that follow. 
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In that case, other model selectors like, say, AIC, AICc, FPE or BIC, give 
a distorted picture of the model's performance, and the model selected by 
one of these model selection criteria can be anything from mildly subopti- 
mal to completely unreasonable, depending on unknown parameters. These 
phenomena are discussed at length in Section 3.3 of [20] for the special case 
where the regressors in (2) are centered to have mean zero and where can- 
didate models do not include an intercept. That discussion also applies to 
the setting that is considered here, mutatis mutandis. 

Because the upper bound in (4) decreases exponentially fast in n — |m|. 
Theorem 3.1 can be used together with Bonferroni's inequality to describe 
the performance of fP'{m) when this estimator is used to evaluate the per- 
formance of several candidate models. For the collection ^A of candidate 
models introduced at the end of Section 2, recall that model nip minimizes 
p'^{m) over m £ A4. The truly best model nip is of course infeasible, but 
Theorem 3.1 suggests that p^{m) can be taken as a proxy for p'^(ni). Define 
the empirically best model m as a (measurable) minimizer of p'^{ni) over 
Al; that is, 

m = arg min p^ (m) . 

For the next result, recall that \A4\ denotes the number of parameters in 
the most complex candidate model and that denotes the total number 
of candidate models. 



Corollary 3.2. For each e > and uniformly over all data- generating 
processes as in (2), we have 



(5) 
and 

(6) P 



(m) 

P log ^. >£ <6exp 



log 



p'^inip) 
p^{rn) 



p'^{ni) 



> e < 6 exp 



log#A^ 



log#A^ 



n-\M\ 



16 e + 16 

n-\M\ £^ 
8 ~ 



The first inequality of Corollary 3.2 relates the performance of the em- 
pirically best model (i.e., m) to that of the actually best candidate model 
(i.e., nip) in terms of the relative performance p'^{m) / p^{nip)] if the upper 
bound in (5) is small, the performance of m is close to that of rup with 
high probability. In that case, one can select a "good" model on the basis of 
the data with high probability. Moreover, the second inequality shows that 
the performance of the selected model can be estimated accurately, in terms 
of the relative error | logp^(m)/p^(m)|, with high probability, provided that 
the upper bound in (6) is small. It should be noted that the upper bounds in 
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(5) and (6) do not depend on unknown parameters but only on sample size, 
on the number of candidate models, and on the number of parameters in 
the most complex candidate model (i.e., on n, and |A^|). In particular, 
these upper bounds are small if the degrees of freedom in the most complex 
candidate model (i.e., n— \M.\) is sufficiently large compared to log^^AI. 
This allows for very large classes of potentially very complex candidate mod- 
els [see also Remark 6.3 for a discussion of the role of the constants 
and \M\ in the upper bounds (5), (6), and in the results that follow]. Fi- 
nally, we note that we actually establish a slightly stronger result during 
the proof of Corollary 3.2, namely that the result continues to hold with 
the left-hand side of (6) replaced by P(max„ig_A^ | log/5^(m)/p^(m)| > e). In 
other words, if the upper bound in (6) is small, then {m) / {m) is close 
to one for each m£ M with high probability. In that case, /O^(-) can be used 
to approximate the predictive performance not only of m but also of other 
model selection procedures that differ from m (see [20] for some examples 
and further discussion). 

The results presented so far are concerned with relative errors like, for 
example, log p'^ (ni) / (m) . Theorem 3.1 also entails similar results for abso- 
lute errors like, for example, pP'{m) — p'^{m), that parallel results in [20] and 
are omitted here for the sake of brevity. 

4. Predictive inference based on the selected model. To use the predic- 
tor y'^f\m) for inferences about the unseen future response y^^\ like pre- 
diction intervals for example, the distribution of the prediction error [i.e., of 
y^f)(^rn) — y*--^^] is an object of particular interest. Recall that we write L(m) 
for the conditional distribution of this prediction error given the training 
sample. For a fixed candidate model m and fixed training sample, L(m), of 
course, depends on unknown parameters and, hence, needs to be estimated; 
in particular, we need to estimate the conditional bias and the conditional 
variance of the predictor. Proposition 2.1 shows that (unconditionally) un- 
biased estimators of 1/(171) and of 5^(m) are given by zero and by 

5^(?n) =a'^{m) — ^ j — -, 

n — 1 — \m\ 

respectively (i.e., E[i'{m)\ = E[8^{'m) — (5^(m)] = 0). This suggests that the 
distribution in question [i.e., L(m) = N{v{rn),5'^{m))\ might be estimated by 
7V(0, 5'^{m)). For technical reasons, we consider a slightly different estimator. 
In particular, we estimate 5^(m) by 

5^[m) = a^{m) j — - 

n + 1 — \m\ 

[which coincides with the estimator fP{m) discussed in Section 3], and we 
estimate the conditional distribution of the prediction error [i.e., L(m)] by 

L{m)=N{0,P{m)). 
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The next result describes the finite-sample performance of ]L(m) as an esti- 
mator for L(m) in terms of the total variation distance. 

Theorem 4.1. Fix a candidate model m G A^. For the conditional dis- 
tribution of the prediction error of the predictor y^^\m), conditional on the 
training sample, and for its estimated version [i.e., /or L(m) and for l^(m) ] 
we have 



n — \m\ 



(7) p(^||]L(m)-L(m)||TV>^ + e) <7exp 

for each e with < e < log(2). The upper hound in the preceding display 
holds uniformly over the set of all data- generating processes as in (2). 

Remark 4.1. Because the total variation distance of two probability 
measures is at most 1, the condition that e is at most log(2) ~ 0.69 main- 
tained by Theorem 4.1 is rather innocuous. Inspection of the proof of The- 
orem 4.1 shows that one can obtain a slightly improved upper bound that 
also holds for all e > 0. The downside of this is that the improved upper 
bound is much more complicated and less revealing. 

By Theorem 4.1, the estimated distribution L(?7t,) is close to the true 
distribution L(m) in total variation with high probability, provided only 
that n — \m\ is large enough, independently of the unknown parameters. 
While the theorem places no restrictions on sample size and on the candidate 
model m [except for (3)], the result is most interesting in the case where the 
candidate model is relatively complex in the sense that \m\/n is not close to 
zero (see the discussion following Theorem 3.1). 

The impact of Theorem 4.1 for inference after model selection is immedi- 
ate in view of Bonferroni's inequality. For the following results, consider the 
collection M of candidate models introduced in Section 2. Recall that #A1 
and \M\ denote the total number of candidate models and the number of 
parameters in the most complex candidate model, respectively; moreover, 
recall that mp denotes the best candidate model and rh denotes the em- 
pirically best candidate model [in the sense that they minimize p'^{m) and 
p^{m), respectively, over m^M]. 

Corollary 4.2. For e satisfying < e < log(2), and uniformly over all 
data- generating processes as in (2), we have 



P^||L(m) -L(m)||TV > + < 7exp 



1 ALK, n-\M\ e- 
log#M ^ 



2 



e + 2 
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During the proof, we actually derive a slightly stronger version of Corol- 
lary 4.2. The result continues to hold if the left-hand side in the preceding 
inequality is replaced by P(maxmex ||lL(?7i) — L(m)||TV > l/y/n + e). This 
can be used, say, to conduct inference based on the model selected by another 
model selection procedure that differs from m. 

For the rest of this section, we illustrate the use of our results to construct 
symmetric prediction intervals centered at y^^\m) that are approximately 
valid and short. Similar results can be obtained for one-sided prediction in- 
tervals or for testing whether, say, the future response lies above (or below) 
a prespecified value. Conditional on the training sample, the prediction error 
y^f\rh) — y^f^ is distributed as L(m) = N{v{m),5'^{'m)). Hence, a "predic- 
tion interval" for y^^^ with conditional coverage probability 1 — a is given 
by [y^f\m) — v{m) — qa5{m) ,y^f\rh) — v{rn) + qa5{rn)\^ and we write this 
"prediction interval" informally as 

y^^\m) — u{m) it qa6{m); 

here, Qa is the 1 — a/2 quantile of the standard normal distribution. Note that 
this construction is infeasible, because it depends on unknown parameters 
through z/(?fi) and 5{m). Corollary 4.2 suggests that a feasible prediction 
interval can be obtained by replacing the true distribution L(m) by the 
approximating distribution ]L(?n) and constructing a prediction interval with 
nominal coverage probability 1 — a using ]L(m) . This amounts to replacing 
z^(m) and 6(771) by zero and by 6{rh), respectively, in the preceding display. 
The resulting prediction interval will be denoted by I[m) and is given by 

(8) X{m) \y^^\m) ^ qa5{m). 

In view of Corollary 4.2, we get the following result. 



Proposition 4.3. Fix e satisfying < e < log(2). Conditional on the 
training sample, the coverage probability of the prediction interval I{m) is 
within 1 / y/n + £ of the nominal level, that is, 

1(1 - a) - P{y^f^ e I{m)\Y,X)\ <^ + e, 



n 



except on an event whose probability is not larger than 

n-\M\ £2 



7exp 



log#M 



2 e + 2\ 

uniformly over all data-generating processes as in (2). 

The (infeasible) valid prediction interval based on the selected model rh 
discussed prior to Proposition 4.3 has width 2qa5{'m)^ and the width of the 
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feasible interval X{m) is 2qa5{rh). From the perspective of interval width, 
the "best" model is m^, that is, the model minimizing 5^(m) over m £ Ai 
(see the discussion at the end of Section 2), and the corresponding exact 
shortest "prediction interval" is 



(9) 



Again, this construction is infeasible because and also v{ms) and 5{m^) 
depend on unknown parameters. The following result compares the feasi- 
ble interval I{m) based on the selected model with the infeasible shortest 
possible interval (9) in terms of width, by comparing (5(m) and 6{ms). 

Proposition 4.4. For each e > and uniformly over all data- generating 
processes as in (2), we have 

n-\M\ e"^ 



P 



log 



5(fh) 



S{ms) 



> e < 4 exp 



log#A^ 



e + 2 



If the upper bound in Proposition 4.4 is small, the length of the feasible 
prediction interval I{rh) is close to the length of the (infeasible) shortest 
prediction interval (9) with high probability. Together with Proposition 4.3, 
this result establishes that the interval I{rh) is approximately valid and 
short with high probability, provided only n — |A^| is large enough compared 
to log#A^ (i-e., provided only that the degrees of freedom in the most 
complex candidate model is large compared to the logarithm of the number 
of candidate models). 



5. Simulation example. We now present an example where we search 
for a "sparse" model in a pool of more than 10^^ candidate models using a 
training sample of 2000 observations. We demonstrate that a good candidate 
model can be identified, that the performance of the selected model can 
be estimated with reasonable accuracy and that a prediction interval post 
model selection obtains an actual coverage probability reasonably close to 
its nominal one. The example is meant for demonstration only and should 
not be mistaken for an exhaustive simulation study (see also [20] for related 
simulations also covering non-Gaussian scenarios). 

Consider a situation where we have available a training sample of 2000 in- 
dependent observations of the response y and of 1000 explanatory variables 
Xj, j = 1,...,1000, from (2), and where we suspect that the correspond- 
ing first 1000 coefficients of (5 are "sparse" in the sense that most of them 
are very small or zero while a few groups of adjacent coefficients are large. 
To come up with a collection of candidate models that can pick out the 
suspected groups of "important" coefficients while not being too large (in 
the sense that log#M <C n — |A^|), we divide the first 1000 coefficients of 
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Fig. 1. The first 1000 coejficients of 13 in the sparse case (left panel) and in the nonsparse 
case (right panel). 

P into 50 blocks of length 20 each, and we consider all candidate models 
that include or exclude a block at a time (plus the intercept that is always 
included). This gives 2^*^ or a little over 10^^ candidate models.^ With an 
exhaustive search over this model space being infeasible, we resort to the 
obvious greedy general-to-specific strategy. We fit the "overall" model con- 
taining all 50 blocks, and eliminate that block whose elimination leads to the 
smallest increase in the residual sum of squares. This results in a model con- 
taining 49 blocks, and now we proceed inductively until all blocks have been 
eliminated and only the intercept remains. This results in a data-driven 
rearrangement of the blocks and, thus, of the whole parameter vector /3. The 
selected model here is the minimizer of p^{-) among the models visited by 
the greedy search and will be denoted by nig throughout this section. 

The suspicion that /? is sparse, which motivated our choice of candidate 
models, may or may not be correct in practice. For the true value of the pa- 
rameter P, we therefore consider two scenarios. In the first, the coefficients 
of /? are indeed sparse, and, in the second, they are not (i.e., a "sparse" and a 
"nonsparse" case). The first 1000 coefficients of (3 in both the sparse and the 
nonsparse case are displayed in Figure 1; the remaining coefficients of f3 are 
set to zero. The first 1000 coefficients of /3 were obtained from realizations 
of ARCH-processes with different parameters for the sparse and for the non- 



^We have also experimented with larger (smaller) block-sizes that lead to correspond- 
ingly smaller (larger) classes of candidate models. Larger block-sizes give better accuracy 
of p^{-) as an estimator for p'^(-) and better coverage properties of prediction intervals 
post model selection; smaller block-sizes have the opposite effect. 

^"The study of alternative and potentially superior strategies of searching through 
model space is beyond the scope of this paper. 
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Fig. 2. Results from one simulation run for the sparse case (left panel) and the nonsparse 
case (right panel). The graphics are described in the main text. 



sparse case. In both cases, the coefficients of (3 were also scaled so that the 
"signal-to-noise" ratio is five in the sense that (Var(y) — Var(n))/ Var(n) = 5. 
If the signal-to-noise ratio is too small, only very parsimonious models per- 
form well; a large signal-to- noise ratio has the opposite effect. We chose a 
signal-to-noise ratio between these two extremes. The remaining parameters 
in (2) were chosen as follows. We chose u independent of the Xj's with mean 
and variance 1, the variance/covariance structure of the explanatory vari- 
ables was chosen so that Cov{xj,Xk) = 2~l-^~'''l for j,k € {2, . . . , 1000} (recall 
that xi = 1 is the intercept), and we took independent realizations of a 
standard normal for the means of the x/s scaled so J2f=2 E{xj)Pj = ^2.'^^ 
For one set of training data [i.e., for 2000 observations from (2) with the 
parameters just described] the results in both the sparse and the nonsparse 
case are visualized in Figure 2. The data-driven rearrangement of (3 obtained 
by the greedy search is shown at the bottom of each panel next to the axis 
labeled Beta. The block of 20 coefficients to the far right was eliminated 
first, the block next to it was eliminated next, et cetera, until only the inter- 
cept remained. Note that this corresponds to a data-driven sequence of 51 
nested models of increasing complexity, from the model containing only the 
intercept up to the overall model containing all 1000 explanatory variables; 
the horizontal axis can be thought of as indexing these 51 nested models. 



^^In additional experiments, with several other choices for /3, we obtained results con- 
sistent with those presented here. 

^^Our choice for the variance/covariance structure of the Xj's is ad-hoc. Repeating the 
simulations with Cov{xj,Xk) = r'-*"*^' for other values of r between and 0.9, we obtained 
basically identical results. The same applies, mutatis mutandis, to the choice of the E{xj)'s 
and to the scaling of the means. 
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The performance of each of the 51 models obtained by the greedy search is 
shown by the graph in the middle of each panel, next to the axis labeled 
Mean-Squared Error. The black line shows the value of fP'{-) for each of the 
models (estimated performance), while the gray line shows the true value 
of /O^(-) (actual performance). For better readability, points are joined by 
lines. The selected model is indicated by a vertical dashed line. Note that 
the conditional mean-squared error of any predictor is bounded from be- 
low by Var(u), which equals 1 here. Therefore, models m with p'^{m) close 
to 1 perform well. Finally, for each of the 51 models obtained through the 
greedy search, the coverage probability of a prediction interval based on that 
model is shown at the top of each panel, next to the axis labeled Coverage 
Probability. For each such model m, we computed the prediction interval 
I{rn) introduced in (8) with 1 — a = 0.95. The black line shows the true con- 
ditional coverage probability of these intervals, conditional on the training 
sample. Again, points are joined by lines. The gray horizontal line at 0.95 is 
for reference. Note that models with small /5^(m) also correspond to short 
prediction intervals, because the width of T(m) is governed by 5{m) = p{m). 
The conditional coverage probability of the prediction interval based on the 
selected model is indicated by the vertical dashed line. Because coverage 
probabilities are computed conditional on the training sample, they can be 
both above and below the nominal value of 0.95. Because the 51 models 
shown in each panel of Figure 2 were obtained through a greedy search 
through model space, fP{-) tends to under-estimate /O^(-) for these models, 
resulting in prediction intervals that tend to be too short and whose condi- 
tional coverage probabilities tend to fall below 0.95. 

In the sparse case (left panel), the chosen class of candidate models is 
satisfactory in the sense that it contains a relatively parsimonious candidate 
model that performs well. The selected model rhg contains 100 explanatory 
variables [and coincides with the model minimizing the actual performance 
p^(-) among the 51 candidates identified by the greedy search]. The selected 
model's estimated performance of fP'{mg) = 1.110 is close to its actual per- 
formance of p'^{rhg) = 1.124 (which in turn is close to the lower bound 1). 
The conditional coverage probability of the prediction interval based on the 
selected model is 0.948. In the nonsparse case (right panel), the class of can- 
didate models is unsatisfactory in the sense that it does not contain a simple 
model that performs well. Among the 51 models identified by the greedy 
search, the model with 940 explanatory variables performs best [minimizer 
of /5^(-)]) while the model rhg selected by minimizing p^{-) contains 900 co- 
efficients. The actual performance of the selected model is p'^{rhg) = 1.929, 
while its estimated performance is fP'{mg) = 1.924. The selected model im- 
proves little over the overall model containing all 1000 explanatory variables, 
in terms of actual performance as well as in terms of estimated performance. 
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The conditional coverage probability of the prediction interval based on the 
selected model is 0.949. Overall, the minimum of the coverage probabili- 
ties over the 51 models found by the greedy search (i.e., the minimum of 
the curve next to the axis labeled Coverage Probability) is 0.935 in the 
sparse case and 0.938 in the nonsparse case, respectively. 

The experiment whose results are shown in Figure 2 was repeated a total 
of 100 times. The results of these repetitions are so similar to those shown in 
Figure 2 that we do not present them here in detail. Over the 100 repetitions 
and for the conditional coverage probability corresponding to the selected 
model, we obtained a median of 0.949 and a minimum of 0.936 in the sparse 
case, and a median of 0.942 and a minimum of 0.924 in the nonsparse case. 
Also, over 100 repetitions and for the minimum of the coverage probabilities 
corresponding to the 51 models identified by the greedy search, we obtained 
a median of 0.931 and a minimum of 0.919 in the sparse case and a median 
of 0.934 and a minimum of also 0.918 in the nonsparse case. 

6. Remarks and extensions. 

Remark 6.1 [Examples of data- generating processes as in (2)]. The dis- 
tribution of the random variables in (2) is, of course, characterized by their 
first and second moments. Assume, for simplicity, that the Xj's are uncorre- 
lated with u and that u has mean zero. Write /? for the sequence of regression 
coefficients (3 = (/9j)j>i, write o"^ for the variance of u and denote the se- 
quence of means and the variance/covariance net of the Xj+i's, j > 1, by 
7 = (7j)j>i and S = {'^j,k)j°k=i (recall that xi denotes the intercept, i.e., 
xi = 1). That is, the mean and the variance of xj^i are 7j and Sjj, re- 
spectively, and the covariance of Xj+i and x^+i is Sj 1 < j < k. Then, the 
(joint) distribution of y, xj , j >l and u in (2) is characterized by (/3, 7, S, a) . 
Write H for the collection of all quadruples ^ = (/?,7, E,(t) such that the se- 
ries in (2) converges in L2, and such that the joint distribution of the xj's 
for j > 1 and of u is nondegenerate. The following examples illustrate that 
H is quite rich and includes subsets that are noncompact (with respect to 
the appropriate canonical topology): 

(i) Assume that the Xj's for j > 1 are uncorrelated with common vari- 
ance equal to unity, and write / for the corresponding variance/covariance 
net (i.e., Ij^k equals one if j = /c and zero otherwise). Then, S contains all 
quadruples ^ of the form ^ = (/?, 7, /, a) satisfying (3 £ I2, 7 G ^2, and o" > 0. 

(ii) Let <; = (?j)j>i be an arbitrary sequence of positive numbers. Assume 
that the uncorrelated as in (i) before but now with Vaic{xj) = c^J, 
and write diag(<;"^) for the corresponding variance/covariance net. Then, H 
contains all quadruples ^ of the form ^ = (/?, 7, diag(<;^), cj) for which S h 
and G I2 (where the products are understood component-wise) and cr > 0. 
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(iii) Fix p satisfying 1 < p < oo, and let q be such that, either 1 < p < co 
and 1/p + 1/g = 1, or p = 1 and q = oo, or p = oo and (? = 1. Let S dp ^ 
Iq be a continuous Hnear operator satisfying {a,S(3) = {Sa,f3) for each a 
and P in Ip, and satisfying (a, Sa) > whenever a (z Ip is nonzero. Here, 
(•,•) denotes the usual product of sequences (i.e., the sum of component- 
wise products). The operator S defines a variance/covariance net S(S') by 
T,{S)j^k = i^j^Sek), where e/ denotes a sequence with a 1 in the Ith position 
and zeroes otherwise {I > 1). Then, H contains all quadruples ^ of the form 
^ = 7, S(S'), cj) satisfying /? G Zp, 7 G Ig, S as before, and cr > 0. 

Remark 6.2 (Reduced-rank models). We have required that the joint 
distribution of the Xj's for j > 1 in (2) is nondegenerate (and Gaussian). 
For our purpose, this guarantees that the n x \m\ matrix of those regres- 
sors in the training sample that are included in a model m S is non- 
degenerate with probability one. We now discuss the case where this re- 
quirement is not met. Assume, for a candidate model m £ M, that some 
of the explanatory variables xj that are included in the model m are per- 
fectly correlated with each other. In that case, there is a submodel m' of 
m (i.e., a model m' satisfying m'j < rrij for each j), such that the explana- 
tory variables included in model m' are not perfectly correlated with each 
other, and such that the least-squares predictors based on model m and m' 
coincide [i.e., y^f\in) = y^f\m')\, almost surely. Here, the restricted least- 
squares estimator (3{m) needs to be computed using a generalized inverse 
in the least-squares formula, because the sample regressor matrix corre- 
sponding to model m is of reduced rank, almost surely. Hence, we also have 
p^{m) = p^im') and L(m) = L(m') a.s. Now, repeat this replacement process 
for each candidate model in M (i.e., replace each reduced-rank model by an 
appropriate full-rank submodel and leave the full-rank models unchanged). 
This results in a new collection of candidate models, which we denote by 
M' . Inspection of the proofs reveals that all the results in Sections 2, 3 and 4 
continue to hold with M' replacing M. 

Remark 6.3 {Note on constants). Several performance bounds that are 
reported in this paper depend on the constants #A1 and \M\ (see Corollar- 
ies 3.2 and 4.2, as well as Propositions 4.3 and 4.4). These bounds are con- 
servative because they hold uniformly over a large class of data-generating 
processes and for each class M of candidate models that satisfies (3). In 
particular, the results also cover the case where all candidate models are 
equally complex and where all fit equally well. In view of this, it is not 
surprising to find the constants and |A^| in the upper bounds. If addi- 
tional regularity conditions are imposed on the regression parameter, and if 
the family of candidate models M is chosen in accordance to these regular- 
ity conditions (e.g., sparse candidate models in case a sparsity condition is 
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imposed on the true regression parameter), it is likely that the upper bounds 
can be improved. Also, the fact that the upper bounds all increase linearly 
with the number of candidate models (i.e., with originates in the use 

of Bonferroni's inequality, which could leave room for improvement. These 
issues, however, are beyond the scope of this paper. 

Remark 6.4 (Asymptotic rates). The results in Sections 3 and 4 al- 
low us to read off the rates at which quantities like logp^(m)/p^(m) or 
log p'^{rh)/p'^ {nip), for example, converge to zero, in probability, in appro- 
priate asymptotic settings. Under rather weak conditions, we show that the 
typical rate is l/i/ra in the following: 

(i) Consider a sequence of sample sizes n, and a corresponding sequence 
of candidate models m^") (that may depend on n), such that \m^^^\/n < r for 

fixed r, < r < 1, and for each n. As always, we also assume that rn'^^ = 1 
and that |m(")| <n — 1. Denoting the distribution of the sample of size n 
by Pn{')-, Theorem 3.1 entails that 



p [m 

log 



p2(m(")) 



> t < 6 exp 



1 -r t 



2 n 



t + s 



for each n, for each t > 0, and uniformly over all data-generating processes 
as in (2). In other words, log/5^(m^"'^)/p^(m^"'^) is of order \/^/n in prob- 
ability, uniformly over all data-generating processes as in (2). In a similar 
fashion, ||]L(m^"'^) — L(m("))||TV is uniformly of order l/y/n in probability 
(see Theorem 4.1). 

(ii) Now, consider a sequence of sample sizes n and a corresponding 
sequence of families of candidate models A^(") [such that (3) holds for each 
m E A^(") and for each n\. Moreover, assume that |A^("^| < r for fixed r, 
< r < 1, and for each n, and that log^^A^^") = o(n). We stress that now 
quantities like the "best" model rrip, the empirically best model m, the 
conditional distribution of the prediction error L(m), its estimated version 
L(m), the prediction interval X(m), et cetera, all depend on n, although this 
dependence is not shown explicitly by the notation. Under these assumptions 
we obtain that the following quantities are each of order 1 / -y/n in probability, 
uniformly over all data-generating processes as in (2): log p^ {m) / p^{rnp) and 
log fp {rh) / p"^ [rh) (see Corollary 3.2); ||]L(m) — L(m)||TV (see Corollary 4.2); 
{1 - a) - Pniy^-^^ eI{mn)\Y,X) (see Proposition 4.3); and log6'^{m)/5'^{ms) 
(see Proposition 4.4). 



APPENDIX A: PROOFS FOR SECTION 2 



The following two lemmas will be instrumental in the proof of Propo- 
sition 2.1. We suspect that these two results, which basically rely on the 
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rotational invariance of the normal distribution, are well known, in some 
form or another, but we could not find a convenient reference in the litera- 
ture. Throughout, the Euclidean norm of a vector u € M'^ is denoted by \\v\\. 



Lemma A.l. For k>l, let N{0,lk) and fix qq G with \\ao\\ = 1. 
Then, there exists a kx k matrix R whose elements are measurable functions 
of a, such that R'R = 1^ and a = ||a||i?ao almost surely. 

Proof. Write Cj for the jth Euclidean basis vector of M'^. It suffices 
to consider the case where = ei, because ao can be written as oq = Sei 
where S is a fixed orthonormal kxk matrix. For oq = ei, consider the event 
E where ||a|| > and where a is linearly independent of 62, . . . , e^. Clearly, 
E is an almost sure event. On E, compute an orthonormal basis ri, . . . ,rfc 
of R^, by setting ri =a/||a|| and by then applying the Gram-Schmidt or- 
thonormalization procedure to n, 62, . . . , e^, and set i? = (n, . . . , r^). On E^, 
set R = Ik, say. The matrix R has the desired properties. □ 

Lemma A. 2. Let M be akxl matrix with i.i.d. standard normal entries, 
and let oq G M'' with ||ao|| = 1 (l<l<k). For Pm = M{M'M)-^M', the 
distribution of a^PujaQ is given by 

a'oPMao - 



Xi + xl- 



where xf ^.''^d xl-i denote two independent chi-square random variables with 
the indicated degrees of freedom. In case I = k, xl-i ^-^ interpreted as 

constant equal to zero and the distribution on the right-hand side of the 
preceding display is to be interpreted as point mass at one. 

Proof. As the case I = k is trivial, we may assume that / < k. Let 
a N{0,lk) independent of M. Using Lemma A.l, we can rewrite a as 
a = ||a||i?ao almost surely. With probability one, we thus have oq = i?'a/||a|| 
and agP/\,/ao can be written as 

, ^ a'RPMR'a a'RM{M'R'RM)-^M'R'a a'Pu^a 

aoPAiao = = 

a'a aa a'a 

almost surely, where, for the last equality, we use Mo as shorthand for RM 
and define Pjv/o like Pm with Mq replacing M. Conditional on o, the columns 
of Mo are i.i.d. N(0,lk) [because the columns of M are i.i.d. N{0,lk) inde- 
pendent of a, and because RR' = Ik]. As that conditional distribution does 
not depend on the conditioning variable, we see that the entries of Mq are 
i.i.d. standard Gaussians independent of a. Hence, the columns of Mo are 
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linearly independent with probability one, and the expression on the far 
right-hand side of the preceding display is distributed as xf/(xF + Xfc-z)- 
□ 

Before turning to the proof of Proposition 2.1, the following preparatory 
consideration and the attending lemma are also required. Throughout the 
following, fix a candidate model m € M. Recall the linear model (2), and 
write z for the |m| -vector of those explanatory variables xj that are included 
in the model m (in their natural order, so that zi corresponds to the inter- 
cept, i.e., zi = xi = 1). Because y and z are jointly Gaussian, the conditional 
distribution of y given z is again a Gaussian. Because the model m includes 
an intercept (i.e., zi = 1) the conditional mean of y given z is a linear func- 
tion of z. Recalling that the conditional variance of y given z is a'^{m), we 
see that y\z ~ N{z'6,a'^[m)) for an appropriate |m|-vector 9. In other words, 
(2) can be rewritten as 

(10) y = z'e + v 

with V ~ N{0,a'^{m)) independent of z. The vector z of those explanatory 
variables that are included in model m is also Gaussian, and, in the following, 
we write r] and F for the mean- vector and for the variance/covariance matrix 
of its distribution, respectively: 

(11) zr^N{r],r). 

Clearly, r] is an |m|-vector and F is an \m\ x \m\ matrix. Because the first 
regressor corresponds to the intercept, we have 771 = 1 and Fi^i = 0. In case 
\m\ > 1, the submatrix of F corresponding to Z2, ■ ■ ■ , z^m\ is positive definite 
by assumption [see the discussion following (2)]. 

Lemma A. 3. For fixed m G M, let r] and F be as in (11) and set 
A = T + r]r]' . Then, A is positive definite, and so is its symmetric square root 
A"*^/^. Moreover, the matrix A~"^/^FA~"^/^ admits a spectral representation 
^-i/2p^-i/2 ^ such that A = diag(0, 1, . . . , 1) (i.e., the first eigen- 

value equals zero and all the others equal one), such that W = {w^^\ . . . ,i«(l'"l)) 
with the w^^\ j = 1, . . . , |m|, being orthonormal eigenvectors, and such that 
^{i) ^ A-^/2j^_ particular, VF'A"^/^^ =(1,0,..., 0)' G IRI""I . 

Proof. To show that A > 0, assume that w € R'™"' is such that w'Aw = 
0. Partition w as w = {wijw'^i)' (i.e., into its first component wi and the 
vector w^i of its \m\ — 1 remaining components), partition r] conformably 
as r] = {r]i,ri'_^^y , and let S denote the lower diagonal (|m| — 1) x (|m| — 1) 
submatrix of F. Recall that r/i = 1, that S > 0, and that the first row as 
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well as the first column of F contain zeroes only [see (11) and the attending 
discussion]. Therefore, 

w'Aw = w'Tw + w'rjrj'w = w'^iTjW^i + w'rjrj'w. 

Because w' /^w = and S > 0, we see that w^i = 0. Hence, w' /^w = wItjI = 
Wi, so that wi also equals zero and w = 0. 

Write K as shorthand for A~^/^rA~^/^, and note that K has rank \m\ — 1, 
because T has rank \m\ — 1. Moreover, we have 

K = A-i/2rA"^/2 = A-i/2(A - r?r/')A-i/2 = - A-'/^r]r]' A~^/\ 

Set w^^^ = A~^/^?7, and note that w^^^ is nonzero because r]i = 1. For each 
vector w in the orthogonal complement of w^^\ we thus have Kw = w. 
Hence, |m| — 1 eigenvalues of K equal one, and the corresponding eigen- 
vectors (which can be chosen as normalized and mutually orthogonal) are 
orthogonal to w^^\ The remaining eigenvalue of K must be zero, and w^^^ 
must be a corresponding eigenvector. This entails that = Kw^^^ = w^^^ — 
whence ||'u;(^)|| = 1. Finally, W'A~^/'^ri = W'w^^'^ = (1,0, . . . ,0)'. 

□ 

Proof of Proposition 2.1. Without loss of generality, we may as- 
sume that the random matrices in the following arguments are invertible 
whenever we need them to be, because the event where that is not the case 
has probability zero. For the given model m, write Z for the n x |m| matrix 
of those explanatory variables in the training sample that are included in 
the model m, such that the ith entry of Y and the ith column of Z' are 
independent copies of y and z in (10) for i = 1, . . . ,n. Note that the ith col- 
umn of Z' is distributed as in (11). Let A^/^ W and A be as in Lemma A. 3, 
and set Z^') = ZA-'^/'^W. Lemma A.3 now entails that the ith column of 
is distributed as A^(ei,A), where ei = (1, 0, . . . , 0)' G RI"'! and A is the 
diagonal matrix A = diag(0, 1, . . . , 1). In particular, Z^*) can be partitioned 
as Z(') = (i,Z(°)), where L is an n-vector of ones and Z^°') is an n x (|m| — 1) 
matrix with i.i.d. standard normal entries. 

For e as in (10), setV = Y- Z9, and note that V ~ iV(0, a^{m)In) inde- 
pendent of Z. From this, it follows that a'^{m) ~ o"^(r?i)x^_|^|/(n. — \m\) as 
claimed. Moreover, z^(m) and 6'^{m) can be written as 

u{m) = r]'{Z'Z)-^Z'V and 6^{m) = V' Z{Z' Z)-^T{Z' Zy^ Z'V + a'^{m) 

[compare the definitions of v{m) and (5^(m) in Section 2, as well as (10) 
and (11)]. From the first equation in the preceding display, we also see that 
E[v{iTi)\ = 0. It remains to derive the distribution of v'^{m) and of 5^(m). 
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To this end, we need more convenient representations of these quantities. 
Rewrite 1/(171) as 

= r]'A-'/^WiW'A-'/'-Z'ZA-'/^Wy'W'A~'/^Z'V 

=e[{z^-yz('Y'z^-yv, 

where the last equality follows upon observing that we have set Z^*^ = 
ZA"^/'^W and that W'A~^/'^ri = ei by Lemma A. 3. A similar argument 
gives 

6\7n) - a\m) = V Z^'\z^*y Z'^'^Y^W A'^I'^T A'^I'^W {Z^'^ Z^'^^Y^ Z^'^ V 

= y'z(')(z(')'z('))-^A(z(')'zW)-^z(')V, 

where we use the spectral representation of A^^/2p/\-i/2 given in Lemma A. 3 
to get the last equality. We thus see that v^{m) is the square of the first 
component of the |m|-vector (Z^'^ Z^'^Y^ ^5 5^(m) — a^{rri) is the 
sum of squares of the remaining |m| — 1 components of that vector. 
Partitioning Z(') as Z^*^ = as before, we see that 

{i'{In-Pzio,)iY^i'{In-Pzio))V \ 

- - P,)V I ' 



(12) (Z(')'Z('))"^Z(*)V 



where P^ and Pz(o) denote the orthogonal projections on the space spanned 
by i and on the column space of Z^°\ respectively. Relation (12) follows 
either by using the inversion formula for partitioned matrices on the corre- 
sponding partition of Z^'^ Z^'^ and simplifying, or from geometric properties 
of orthogonal projections. 

For the distribution of v'^{m), recall that v'^{m) is the square of the first 
component of the vector on the right-hand side of (12). In particular, v'^{ni) 
can be written as 



VP I _p , 



t'(/„ -P^(o))t 

The numerator in the preceding display is distributed as (T^(m)xi, indepen- 
dent of Z^°^ . The denominator is a function of Z'^°^ and, hence, independent 
of the numerator. Using the Lemma A. 2 with i/y/n and Z*-") replacing oq 
and M we see, in the notation used in that lemma, that l'P^(o)L has the 
same distribution as nxf^^_^/{x'\jn\~i + ^l~\m\+i)- Hence, 

Ut id \ '^n-\m\+l 
i {In - Pz(o))l ~ n-2 —2 . 

^\m\-l ' ^n-\m\+l 
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This entails that ~ {xi/n)a'^{m){l + xfm|_i/Xn_|m|+i) ^ claimed. 

For the distribution of J^(m), write M as shorthand for (/„ — P^)Z^°\ We 
see from (12) that 

6^{m) - cj2(m) = V'MiM'My^M'V = w'{M'M)-^w, 

where, for the last equality, we use w to denote the (|m| — l)-vector w = 
{M'M)-^I'^M'V. Since V ~ iV(0, cj2(m)/„,), it follows that w ~ iV(0, a'^{m) x 
-^|m|-i)i independent of M. Using Lemma A.l with w and ei replacing a and 
ao; we obtain an orthonormal matrix R such that w = \\w\\Rei almost surely. 
It follows that 5'^{m) — a^{m) can be written as 

\\wfe[{R'MM'R)-^ei = \\wfe[{R'Z^°y{In - P,)Z^°^ R^^ei. 

Write Z(^) as shorthand for Since R'R = I\rn\-i, we see that Z^^) = 

is an n X (|m| — 1) matrix with i.i.d. Gaussian entries, and that Z^^^ 

is independent of w. Partition Z^^^ as Z^-^^ = {z'^\z^), where z'^^ is 
the first column of Z^^^ , and apply the partitioned inversion formula to the 
corresponding partition of (Z^^^ — PJZ^^^). This gives 

almost surely. In the preceding expression, the numerator is distributed as 
cr^(m)x^^l_^ and is independent of the denominator. For the denominator, 

note that Zj is an n- vector of i.i.d. standard Gaussians, and (/„ — Pi){In — 
P, r,\7(R)){In — Pl) is the matrix of an orthogonal projection of rank n — 

|m| + 1 (except on a probability zero event, as is easy to see). It follows 
that 5^(m) — (T^(m) is distributed as o"^(m)x^„|_;^/x^_|m|+i and 5^(m) is 
distributed as (T^(m)(l + xfm|_i/x^_|m|+i) as required. □ 



APPENDIX B: AUXILIARY LEMMAS FOR SECTIONS 3 AND 4 

In this section, we show, in essence, that 5^(m), fp'{m) and p^im) each are 
close to the same value with high probability, provided that n — |m| is large 
enough (see Lemmas B.3, B.4 and B.5, resp.). To derive these results, we 
also need the two elementary lemmas that follow: Lemma B.l gives bounds 
on certain probabilities involving a Xi random variable, and Lemma B.2 
gives a collection of inequalities that will be used later. 

Lemma B.l. LetF{-) denote the cumulative distribution function (c.d.f.) 
of the Xi distribution. Then 

^/ log(t)\ ^/log(t)\ ^ log(t) 
V t-lj \t-lj 
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holds for each t>l. Moreover, we have 



l-F{t)< 



exp 



t + log(t) 



for each t> 0. 



Proof. For the first inequality, write g{t) as sliorthand for the left- 
hand side, and write h{t) for the right-hand side. We need to show that 
g{t) < h{t). Since limt~*i git) = lim^^i = 0, this will follow if we can 
show that g'{t) < h'{t). First, note that g'{t) is given by 



F'it 



\og{t) 



t-l 



1 



log(t) 

t-l {t-iy 



F' 



, nos{t) 
t-l 



1 



log(t) 



t{t-l) 



(t-iy 



F' 



t-l 



F'it 



. log(^) 
t-l 




-(2t-l)/(2t-2) 



where the two equalities in the preceding display follow by plugging-in the 
formula F'{t) = t~^/^exp(— t/2)/-v/27r and simplifying. We need to show that 
g'{t)/h'{t) < 1; that is, 



^log(t)^_(l/2)(iog(t)/(t_l)_l) ^ ^ 

[the left-hand side of the preceding inequality equals g'{t)/h'(t), which is eas- 
ily seen by using the formula for g'{t) obtained before and h'{t) = l/(t\/2vre)]. 
For s satisfying < s < 1 set u{s) = ^/sexp{—{s — l)/2), and set v{t) = 
log(t)/(t — 1) for t as before. For each t > 1, we have < v{t) < 1, so that 
u{v{t)) is well defined. Clearly, the left-hand side of the inequality in the pre- 
ceding display can be written as u{v{t)), and we need to show that u{v{t)) < 
1. Since limj^i v{t) = 1 (as is easy to see), we get limf^i u(v{t)) = 1. It hence 
suffices to show that u{v(t)) is decreasing or, equivalently, du{v{t))/dt = 
u' {v{t))v' {t) < for t > 1. It is now elementary to verify that v'{t) < for 
t > 1 and that u'{s) > for s satisfying < s < 1. Hence, u'{v{t))v'{t) < 
and u{v{t)) is decreasing. 

For the second inequality, write $(•) and respectively, for the c.d.f. 
and for the Lebesgue density of the standard normal distribution. The result 
follows upon observing that 1 — F{t) = 2(1 — <I>(\/t)) and that 



2(l-^>(Vt)) <2 



HVt) 
Vt 



2 

— exp 
vr 



t + log(t) 



where the inequality holds because of the well-known argument that 1 
$(t) = /^°° ^{u) du < J^°^{1 + 1/m2)^(u) du = (l){t)/t for t > 0. □ 
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Lemma B.2. (i) For s satisfying < s < 1 and for t>0, we have 

, e* + s - 1 , , 
t — s log > ( 1 — s 



t + l + s 



(ii) For s and t satisfying < s < 1 and <t < — log(l — s), we have 

-t-slog(e"* + s- 1) > t - slog(e* + s - 1). 

(iii) For t>0, we have 



^2 

e* - 1 - t > e~* - 1 + t > 



t + 2 

Proof. For part (i), set f{t) = t - slog((e* + s- l)/s) and g{t) = (1 - 
s)t^/(t + 1 + s). To show that f {t) > g(t), first note that /(O) and g{0) are 
both equal to zero. It thus suffices to show that f'{t) > g'{t) for each t > 0. 
It is easy to see that 

f'{t) = {l-s) ^ and g>{t) = {l-s) \ \ 

Plugging these formulae into the relation f'{t) > g'{t) and simplifying, we 
see that the relation is equivalent to 

(e* - 1)(1 + s)^ - st"^ - 2s{l + s)t > 0. 

Replacing e* — 1 by t + in the preceding expression, we obtain a lower 
bound for the left-hand side. After trivial simplifications, that bound reduces 
to t(l — s^) + + s^)/2 which is nonnegative because t > and s < 1. 

For part (ii), let f{t) = t - slog(e* + s - 1) and h{t) = f{-t) - f{t). We 
need to show that h{t) > 0. Since /i(0) = 0, it remains to show that h'{t) > 0. 
Now, h{t) = -2t-s log(e-* + s - 1) + s log(e* + s - 1) and 



h'{t) = -2 + —^-—- + 



+ S-1 e* + s- l 

Note that, by choice of f < — log(l — s) and t>0, both denominators in the 
two fractions in the preceding display are positive. Multiplying the expres- 
sions in the preceding display by (e~* -|-s — l)(e*-|-s — 1) >0 and simplifying, 
we see that h'{t) > if 

(l-s)(2-s)(e*- l)(l-e-*) >0. 

This inequality is, of course, always satisfied because s < 1 and t>0. 

For part (iii), ffist expand e* -e"* as E^o(*^ - = 2Ej°lo*^^^V 

{2j + 1)!. Hence, e* — e~* > 2t, which is equivalent to the first inequality 
in (iii). For the second inequality, set f{t) = e~* — 1 -|- t and g{t) = t^ /{t + 
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2). Since /(O) = g{0) = 0, it suffices to show that f'{t) > g'{t), and that 
inequahty is easily seen to be equivalent to 

4-e"*(t + 2)2 > 0. 

Write h{t) for the left-hand side of the preceding inequality. Observing that 
h{0) = and that h'{t) = e~H{t + 2) > completes the proof. □ 

We are now ready to give the three results that state that 6^{m), p^{'m) 
and p'^{m) each are close to the same fixed value with high probability, pro- 
vided that n— \m\ is large enough. That value is taken as na'^{m)/{n — \m\ + 
1), which is close but not equal to £'[(5^(m)] or £'[/?^(m)] (see the discussion 
and formula for Elp^im)] given at the end of Section 2). Throughout, let m 
be a fixed candidate model from ^A. 



Lemma B.3. For each t > 0, we have 
, n — \m\ + 1 



P 5 



(m) 



na^{m) 



> exp(t) < exp 



n ■ 



\m\ + 1 



t + l + (|m| - l)/n 



and P{6'^{m){n — \m\ + I) / {na'^ (m)) < exp(— t)) is also bounded by the ex- 
pression on the right-hand side of the preceding display. Clearly, that expres- 
sion is not larger than exp[— ((n— \m\)/2)t'^ /(t + 2)]. 

Proof. The case |m| = 1 is trivial, as then (5^(m) = a'^{m) by Propo- 
sition 2.1, and the probabilities of interest reduce to P(l > exp(t)) and 
P{1 < exp(—t)), which are both equal to zero. Assume, henceforth, that 
\m\ > 1. Let A and B be independent and distributed as vl ~ xfmi-i 



tion 2.1, and 



. Then, 5'^{m) is distributed as a'^{m){\ -\- A/ B) by Proposi- 



a'^{m){l + A/B) 



n ■ 



\m\ -\- 1 



\m\ 



1 fA{n 



\m\ 



n 



V B{\m\-1) 



1+1 



n(T^(m) 

(as is elementary to verify). 

First, consider P{6'^{m){n — \m\ + l)/{na'^{m)) > exp(t)). In view of the 
consideration in the preceding paragraph, this probability equals 

' A{n — \m\ + 1) 



B(\m\ - I) 



1 > 



n 



\m\ 



1 



r 



Using Lemma A.l of [20] [with |m| — 1, n — \m\ -\- 1 and (exp(t) — l)re/(n — 
|m| + 1) replacing a, b and e, resp.], the probability in the preceding display 
is not larger than 



exp 



n 



\m\ -\- 1 



/C 



m 



n ■ 



1 

\m\ + 1 



n 



n 



\m\ + 1 
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where the function /C(r, c) is defined for r > and c > — r by /C(r, c) = (1 + 
r) log((l + r + c)/(l + r)) — r log((r + c)/r). We need to show that the factor 
involving the /C-function in the preceding display satisfies 

IC( 1":'-^ , (e* - 1) ) > /, ■ 

\n— |m|+l n — |m| + 1/ t + 1 + (|m| — l)/n 

To this end, write s as shorthand for (|m| — l)/n and note that we always 
have < s < 1. With this, the relation in the preceding display is equivalent 
to 

1 / e* + s- l\ ^ 
t — s log > 



1-sV s J ~ t + l + s 

(expand the formula for the /C-function and simplify). It now follows from 
part (i) of Lemma B.2 that the relation in the preceding display holds. 

Now, consider P{5'^{m){n — \m\ + l)/(no"^(m)) < exp(— i)), or, equiva- 
lently, 

^ ^ V -B(|m|-1) |m|-l^ 

In case (e~* — l)ri/(|m| — 1) < —1, this probability is zero and hence trivially 
bounded as claimed. In the case where (e~* — l)n/ (|m| — 1) > —1, or, equiv- 
alently, t < — log(l — s), we argue as in the preceding paragraph, mutatis 
mutandis, to see that (13) is bounded as claimed if 

1 / , e-' + s-l \ ^ 
-t — s log > 



l-s\ s J ~ t+l + s 

But this relation follows by first applying part (ii) and then part (i) of 
Lemma B.2 as before. □ 



Lemma B.4. For each t>0, we have 

P[P ("i) 2f \ > exp(t) < exp 



n — \m\ t^ 
2 1+2 



and P{fP'{m){n — \m\ + l)/(n(T^(m)) < exp(— t)) is also hounded by the ex- 
pression on the right-hand side of the preceding display. The result continues 
to hold with 5'^{m) replacing fp{m). 

Proof. For B ~ Xn-\m.\^ have o"^(m) ~ a'^{m)B / {n— \m\) (see Propo- 
sition 2.1). Hence, fp'{m) = na'^{m) / {n— \m\ + 1) is distributed as {a^{m)B / {n- 
\m\))n/{n — \m\ + 1), and 

a'^{m)B n n—\m\ + l B 
n — \m\ n — \m\ + 1 na'^{m) n — \m\ 
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First consider P{fP'{rn){n— |m| + 1) / {na"^ {m)) > exp(t)). By the preceding 
consideration, this probabihty equals 



P 



B 



n 



m 



1 > exp(t) - 1 



Using Lemma A. 2 of [20] [with n — |m| and exp(t) — 1 replacing h and e, 
resp.], we see that the probability in the preceding display is not larger than 



exp 



n 



(e*-l-t) 



Now, Lemma B.2(iii) entails that the expression in the preceding display is 
bounded by exp[— (n — |m|)t^/(2(t + 2))] as required. The derivation of the 
upper bound for P{/P{m){n — \m\ + l)/{na'^{m)) < exp(— i)) is completely 
analogous. Finally, the statement in parentheses follows, because p^{m) and 
6^ [in) are given by the same formula. □ 



Lemma B.5. For each t > 0, we have 

< exp(— t) ) < exp 



P[p{mf- 



and 



„i / -.on — \m\ + 1 
P[p{mf ' ' 



2 1 



> exp(t) < 3 exp 



n — |m| t 
2 t + 2 



n — \m\ t 
4 tTl 



Proof. The first inequality follows immediately from Lemma B.3 upon 
noting that 5^(?n) < 6'^{m) + u'^{m) = p{m)'^. 

The second inequality holds trivially in case the upper bound is larger 
than one. We exclude the trivial case and hence assume that log(3) < 
((n— |m|)/4)t^/(t + 4). For later use, we note that this entails that 1 < nt/2 
[because log(3) < {n/A)t^/{t + 4) < (n/4)t , so that 2 log(3) < nt/2, where the 
lower bound is larger than one]. In the second inequality of the lemma, the 
expression on the left-hand side is bounded by 



P[ 5 



(14) 



n 



\m\ + 1 



[m 



> exp(t/2) 



n 



— \m\ + 1 t , , . 

>-exp(t/2) 



because p'^{m) = u'^{m) + 5^(m) and e* = e*/^e*/^ > e*/^(l +t/2). The first 



term in (14) is bounded by exp[— ((n 
with t/2 replacing t and simplify). 



\m\)/4:)r/{t + 4)] (use Lemma B.3 
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To complete the proof, we need to show that the second term in (14) is 
bounded by 2exp[— ((n — |m|)/4)t^/(t + 4)]. To this end, recall from Proposi- 
tion 2.1 that z^^(m) is distributed as {A/n)6'^{m), where ^ ~ Xi independent 
of 5^(m). Hence, the second term in (14) is bounded by 

p(a> n'-) + P f6\m)'^-M+l > exp(t/2)) . 

The second term in the preceding display coincides with the first term in 
(14) and is bounded by exp[— ((n— |m|)/4)t^/(t + 4)] as shown before. To 
complete the proof, we need to show that the first term is also bounded by 
that quantity. By the second inequality of Lemma B.l, the term in question is 
bounded by y^2/7rexp[— nt/4 — log(nt/2)/2]. Now recall that we have nt/2 > 
1 and note that \/2/it < 1. Hence, the first term in the preceding display is 
bounded by exp[-nt/i] < exp[-((n - \m\) / 4)t'^ / {t + 4)]. □ 



APPENDIX C: PROOFS FOR SECTION 3 



Proof of Theorem 3.1. We first derive separate upper bounds for 
P{p^{m)/ p^{m) < e~^) and for P{p^{m)/ fP'{m) > e^). 
If {m) / fP' {m) < e~^, then either 



n ■ 



\m\ + 1 



< exp(— e/2) or p'^{m) 



n 



\m\ + 1 



> exp(e/2) 



Using Lemma B.5 to bound the probability of the first event in the preceding 
display and using Lemma B.4 to bound the probability of the second one, 
we see that 



p^{m) 



< e" 



< 2exp 



n ■ 



\m\ £ 



2 1 



£ + 4 



Clearly, the upper bound in the preceding display is not larger than 2exp[— ((n- 
|m|)/8)eV(e + 8)]. 

For P {p^ {m) / fP' {m) > e^), we argue similarly as in the preceding para- 
graph to obtain 



P 



p'^{m) 



> < 4 exp 



Relation (4) follows from this. □ 



n 



2 n 



e + i 



Corollary C.l. In the setting of Theorem 3.1, relation (4) continues 
to hold with 6'^{m) replacing p'^{m); in that case, the constants 6 and 8 in 
(4) can both be replaced by 4. 
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Proof. The result follows by arguing as in the proof of Theorem 3.1, 
mutatis mutandis, now using Lemma B.3 instead of Lemma B.5. □ 



Proof of Corollary 3.2. 



max 



log 



Let E denote the event where 



<e/2, 



and note that the complement of that is, E'^, is such that 

fP{ni) ^ _ ^ ^ _ [ n—\m\ (e/2)^ 



< 



log 



< 6#M exp 



\M\ 



n 



>e/2] < 6exp 

.2 



(e/2) + ' 



£ 



16 e + 16 



(In the preceding chain of inequalities, the first one is derived from Bonfer- 
roni's inequality, the second one follows from Theorem 3.1, and the last one 
is obvious in view of the definitions of and |A^|.) 

To derive the first statement of the corollary, first note that the relation 
< log(p^(m)/p^(mp)) is always satisfied. Moreover, observe that 



log 



log ^17^7^+ log ■'^ ^ ^ 



log 



p'^irup) 



p^irup) 



p'^{mp) ° p^{'m) ' ° ^{mp) 

almost surely, because the event where p^{m) > for each m £ Ai has prob- 
ability one. On the right-hand side of the preceding equality, the second 
term is nonpositive, and hence 



log 



p'^irup) 



< 



log 



p^{rh) 



+ 



log 



p2(mp) 



p2(mp) 



almost surely. Hence, on the event E, we see that \og{p^{rh) / p'^{mp)) is 
between zero and e, and P{E^) is bounded from above as required. 

For the second statement of the corollary, define E as before but now 
with e replacing e/2. It is easy to see that now P(E^) is not larger than 
6#>iexp[-((n - \M\)/8)e'^ /{e + 8)]. On the event E, we clearly have 
I log (m) //3^ (rfi) \ <e. □ 



APPENDIX D: PROOFS FOR SECTION 4 

The following lemma provides an upper bound for the total variation 
distance of two normal distributions in terms of their parameters and will 
be instrumental in the proof of Theorem 4.1. We believe that the lemma is 
well known, in some form or another, but we could not find an appropriate 
reference. 
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Lemma D.l. Write iV(a, s^) and iV(0, 1) for the Gaussian measures 
with the indicated parameters (where a € M, s > 0). Then the total variation 
distance of these two measures is bounded as 

V27r v27re 

Remark D.l. Of course \\N{a,s'^) — A^(0,1)||tv is trivially bounded 
by one. Moreover, the lemma also entails that that total variation dis- 
tance is also bounded by \a/s\/V^ + \ log(s^)|/V27re, because ||A^(a,s^) — 
N{0, 1)||tv = ||iV(0, 1) - Ni-a/s, l/s^) 



|TV- 



Proof of Lemma D.l. Recall that the total variation distance of two 
mutually absolutely continuous probability measures P and Q is given by 

(15) IIP - QIItv = P{log{p/q) > 0) - Q{log{p/q) > 0), 

where p and q are the densities of P and Q, respectively, with respect to 
a common dominating sigma-finite measure. Write i^(i) for the Lebesgue 
density of A^(0, 1), and note that the Lebesgue density of N{a,s'^) is then 
given by (j){{t — a)/s)/s. The log-likelihood ratio of N{a,s'^) and A^(0, 1) in 
hence given by 



The total variation distance of iV(a, s^) and A^(0, 1) is bounded from above 
by ||iV(a,s2)-iV(a,l)||TV + ||iV(o,l)-A^(0,l)||TV or, equivalently, by 

(17) ||iV(a, 1) - A^(0, 1)||tv + \\N{0,s^) - N{0, 1)||tv. 

The proof will be complete if we can show that the first term in (17) 
is bounded by |a|/\/27r and that the second term in (17) is bounded by 
|log(s2)|/V2^. 

To bound the first term in (17), we first use (16) with replaced by 
1 to see that the log-likelihood ratio of N{a,l) and A^(0, 1) is given by 
(— 1/2)(— 2ta + a"^), which is positive if and only if ta > a'^ /2. Using (15) 
with N{a,l) and A^(0, 1) replacing P and Q, respectively, it is elementary 
to verify that 

||iV(a,l)-iV(0,l)||TV = 2«>(|a|/2)-l, 

where <!>(•) denotes the standard Gaussian c.d.f. For x > 0, set f{x) = 
2<&(x) — 1 and g{x) = x\/2/tt. If we can show that /(x) < g{x), it will fol- 
low that the expression in the preceding display is bounded from above by 
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g{\a\/2) = |a|/v27r. To show that /(x) < g{x) for x > 0, note that /(O) 
5(0) = 0, and that 



The claim now follows because f{x) = /(O) + /(^ f{t) dt < g{0) + Jq g{t) dt = 

For the second term in (17), note that it suffices to consider the case where 
> 1 [because that term is trivially bounded from above by | log(s^)|/\/27re 



m case s 



1; because ||7V(0, s^) _ Ar(o, 1)||tv = ||iV(0, 1) - A^(0, l/s^ 



|Tv; 



and because |log(s^)| = | log(l/s^)|]. Use (16) with a replaced by to see 
that the log-likehhood ratio of A'"(0, s^) and A^(0, 1) is given by — log(s^)/2 — 
t^(l/s^ — l)/2. This log-likelihood ratio is positive at t if and only if > 
s2log(s2)/(s2 - 1), because > 1. Using (15) with N{0,s^) and iV(0,l) 
replacing P and Q, respectively, it is now easy to see that 



||iV(0,s2)-iV(0,l) 



I TV 



F s 



,log(£2) 
S2-1 



F 



log(g^) 

s2- 1 



where F{-) denotes the c.d.f. of a chi-square distributed random variable 
with one degree of freedom. Using the first inequality of Lemma B.l with 
replacing t, we see that the expression in the preceding display is bounded 
by log(s2)/\/27re as required. □ 



Proof of Theorem 4.1. Because \\N{0,5'^{m))- N{i'{m),6'^{m))\\'TY 
\N{-L'{m)/6{m),6'^{m)/6'^{'m)) - iV(0, 1)||tv, Lemma D.l entails that 

\\NiO,6\m)) - Niuim),6\m))\\TV 

^ \i^{m)/6{m)\ ^ \log{5'^{m)/5'^{m))\ 



/27r V27re 
In view of this, and because l/\/2vre < 1/4, we get 

p(\\N{0,6'^{m))-N{iy{m),6'^{m))\\Ty >^+< 



n 



(18) 



<P 



<P 



<P 



\u{m)/6{m)\ \log{6'^{m)/6'^{m))\ 1 



\i'{m)/d{m)\ 1 £ 
^ ^7^^2 



v{m) 



5{m) 



> V27r 



1 £ 

n^2 



+ P 
+ P 



\og{5'^ (m) / 6"^ {m))\ e 



log 



V27re 



m] 



> 



>2e]. 
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For the second term in (18), recall that 6^{m) = p'^{m) and use Corollary C.l 
to obtain 



log 



>2e] < 4 exp 



n ■ 



e + 2 



To complete the proof, we need to show that the first term in (18) is bounded 
by 3exp[— ((n — |m|)/2)e^/(e + 2)]. For the first term in (18), observe that 



P 



> 2tt 



< P( v\m) 



1 



n 



e 

n^2 



\m\ + 1 



+ P 



1 na'^{m) 



> 2tt 



L £ 



^(52(m) n — \m\ + 1 

In the preceding display, the second term on the right-hand side equals 
P{5'^{m) X (n— \m\ + l)/{na'^{m)) < e~^) < exp[— " ^"^^ i+2]' where inequal- 
ity follows from Lemma B.3. It remains to show that, in the preceding dis- 
play, the first term on the right is not larger than 2exp[— ((n — \m\)/2)e'^ /{e + 
2)]. To this end, let Xi independent of S'^{m). In view of Proposition 2.1, 
v'^im) has the same distribution as {A/n)6'^{m). Therefore, 

2 



P v\m) 



n 



m -|- 1 



(19) 



P 



<P 



( —S{m) 



n 



> 2tt 
— \m\ + 1 



1 e 
n^2 



— > 27r — + - 
\n 2 



> 2tt 



-2e 



1 e 



+ P[5\m 



n 



\m\ + 1 



n(j^{m) 



For the second term on the far right-hand side of (19), we again use 
Lemma B.3 to get 

-2 n 



P[ 5\m) 



n ■ 



m -|- 1 



> < exp 



n ■ 



e + 2 



In view of this, the proof will be complete if the first term on the far right 
of (19) is bounded by exp[-((n - |m|)/2)eV(e + 2)]. 
For the first term on the far right of (19), we have 



(20) 



/A fie 



n 



n 



-2e 



< 



exp 



-vrn 



1 



e 

n^2 



-2e 



in view of the second inequality of Lemma B.l and because 27rn(l/-y/n + 
e/2)2e-2^ = 27r(l + ^e/2fe-'^'' > 27re~^'' > 27re-2i°g2 ^ 27r/4 > 1. We now 
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show that the right-hand side of (20) is not larger than exp[— (n/2)e^/ (e + 2)] 
or, equivalently, that 

(21) 2^f^ + jye-2-> 



n 2J ~ e + 2 

In this inequahty, the left-hand side satisfies 

27r(l/V^ + e/2)^e-2^ > {7r/2)eh-^'. 

Thus, (21) will follow if 7r(e + 2) > 26^^ or if f{e) = 7r(e + 2) - 26^^ > 0. It 
is now easy to verify that f{e) is strictly decreasing and that /(log(2)) > 0. 
Hence, the expression on the left of (20) or, equivalently, the first term 
on the far right of (19) is bounded by exp[— (?i/2)e^/(e + 2)] < exp[— ((n — 
|m|)/2)eV(e + 2)]. □ 

Proof of Corollary 4.2. Using Bonferroni's inequality and Theo- 
rem 4.1, this result follows immediately by arguing as in the first paragraph 
of the proof of Corollary 3.2. □ 

Proof of Proposition 4.3. For measurable ACM, write L,{m;A) 
and L(?fi; A) for the probability of A under L(m) and under L(m), respec- 
tively. We have y^^^ G T{m) if and only if y^f\rh) — y^^^ lies in the inter- 
val [—qaS{rh),qaS{m)]. Writing A as shorthand for that interval, the con- 
ditional coverage probability of X(m) equals L,{rh;A). Because the interval 
I{m) is constructed with nominal coverage probability 1 — a assuming that 
y^f\rh) — y^f^ is distributed as ]L(m), we have ]L(m; A) = 1 — a. The result 
now follows immediately from Corollary 4.2. □ 

Proof of Proposition 4.4. To bound P{logS{rh)/6{ms) > e), we first 
note that 

(52(m) 6'^{m) S'^{ms) d^{ms) 

T27 — T = log J- — - + log — 7 < log — r 
o^{ms) 5^{ms) 0^{ms) 0^{ms) 

almost surely, because m is a minimizer of (j^(-) = p'^{-), and because the 
event where 5^(m) > for each A4 has probability one. Hence, P(log5(m)/ 
dims) > e) or, equivalently, P {log d'^{m)/ 6'^ (ms) > 2e), is bounded by 

in view of Bonferroni's inequality. 

Similarly, to bound P {log 6 (tti)/ 5 (ms) < —e), we observe that 

log ^TT^ = i°g 72M + :^27^ ^ i°g irri ' 

0^(7715) d^[m) d'^{ms) o^ijn) 
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because ms is a minimizer of (5^(-). Arguing similarly as in the preceding 
paragraph, we thus see that P {log 6'^ (in) /S'^{ms) < — 2e) is bounded from 
above by 




Adding the bounds for P(log5(m)/5(m5) > e) and for P(log(5(m)/(5(m5) < 
—e) obtained so far, we see that P(| log 6 (m)/ 5 {ms)\ > e) is bounded by 



Recalling that =/5^(-), the result now follows from Corollary C.l. □ 
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